{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example and timings\n",
    "\n",
    "This notebook gives a short introduction in how to use pydpc for a simple clustering problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from pydpc import Cluster\n",
    "from pydpc._reference import Cluster as RefCluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start with preparing the data points for clustering. The data is two-dimensional and craeted by drawing random numbers from four superpositioned gaussian distributions which are centered at the corners of a square (indicated by the red dashed lines)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# generate the data points\n",
    "npoints = 2000\n",
    "mux = 1.6\n",
    "muy = 1.6\n",
    "points = np.zeros(shape=(npoints, 2), dtype=np.float64)\n",
    "points[:, 0] = np.random.randn(npoints) + mux * (-1)**np.random.randint(0, high=2, size=npoints)\n",
    "points[:, 1] = np.random.randn(npoints) + muy * (-1)**np.random.randint(0, high=2, size=npoints)\n",
    "# draw the data points\n",
    "fig, ax = plt.subplots(figsize=(5, 5))\n",
    "ax.scatter(points[:, 0], points[:, 1], s=40)\n",
    "ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color=\"red\")\n",
    "ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color=\"red\")\n",
    "ax.set_xlabel(r\"x / a.u.\", fontsize=20)\n",
    "ax.set_ylabel(r\"y / a.u.\", fontsize=20)\n",
    "ax.tick_params(labelsize=15)\n",
    "ax.set_xlim([-7, 7])\n",
    "ax.set_ylim([-7, 7])\n",
    "ax.set_aspect('equal')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now comes the interesting part.\n",
    "\n",
    "We pass the numpy ndarray with the data points to the ``Cluster`` class which prepares the data set for clustering. In this stage, it computes the Euclidean distances between all data points and from that the two properties to identify clusters within the data: each data points' ``density`` and minimal distance ``delta`` to a point of higher density.\n",
    "\n",
    "Once these properties are computed, a decision graph is drawn, where each outlier in the upper right corner represents a different cluster. In our example, we should find four outliers. So far, however, no clustering has yet been done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clu = Cluster(points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the decision graph, we can select the outliers via the ``assign`` method by setting lower bounds for ``delta`` and ``density``. The assign method does the actual clustering; it also shows the decision graph again with the given selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clu.assign(20, 1.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us have a look at the result.\n",
    "\n",
    "We again plot the data and red dashed lines indicating the centeres of the gaussian distributions. Indicated in the left panel by red dots are the four outliers from the decision graph; these are our four cluster centers. The center panel shows the points' densities and the right panel shows the membership to the four clusters by different coloring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 3, figsize=(15, 5))\n",
    "ax[0].scatter(points[:, 0], points[:, 1], s=40)\n",
    "ax[0].scatter(points[clu.clusters, 0], points[clu.clusters, 1], s=50, c=\"red\")\n",
    "ax[1].scatter(points[:, 0], points[:, 1], s=40, c=clu.density)\n",
    "ax[2].scatter(points[:, 0], points[:, 1], s=40, c=clu.membership, cmap=mpl.cm.cool)\n",
    "for _ax in ax:\n",
    "    _ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.set_xlabel(r\"x / a.u.\", fontsize=20)\n",
    "    _ax.set_ylabel(r\"y / a.u.\", fontsize=20)\n",
    "    _ax.tick_params(labelsize=15)\n",
    "    _ax.set_xlim([-7, 7])\n",
    "    _ax.set_ylim([-7, 7])\n",
    "    _ax.set_aspect('equal')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The density peak clusterng can further resolve if the membership of a data point to a certain cluster is strong or rather weak and separates the data points further into core and halo regions.\n",
    "\n",
    "The left panel depicts the border members in grey.\n",
    "The separation in the center panel uses the core/halo criterion of the original authors, the right panel shows a less strict criterion which assumes a halo only between different clusters; here, the halo members are depicted in grey."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 3, figsize=(15, 5))\n",
    "ax[0].scatter(\n",
    "    points[:, 0], points[:, 1],\n",
    "    s=40, c=clu.membership, cmap=mpl.cm.cool)\n",
    "ax[0].scatter(points[clu.border_member, 0], points[clu.border_member, 1], s=40, c=\"grey\")\n",
    "ax[1].scatter(\n",
    "    points[clu.core_idx, 0], points[clu.core_idx, 1],\n",
    "    s=40, c=clu.membership[clu.core_idx], cmap=mpl.cm.cool)\n",
    "ax[1].scatter(points[clu.halo_idx, 0], points[clu.halo_idx, 1], s=40, c=\"grey\")\n",
    "clu.autoplot=False\n",
    "clu.assign(20, 1.5, border_only=True)\n",
    "ax[2].scatter(\n",
    "    points[clu.core_idx, 0], points[clu.core_idx, 1],\n",
    "    s=40, c=clu.membership[clu.core_idx], cmap=mpl.cm.cool)\n",
    "ax[2].scatter(points[clu.halo_idx, 0], points[clu.halo_idx, 1], s=40, c=\"grey\")\n",
    "ax[2].tick_params(labelsize=15)\n",
    "for _ax in ax:\n",
    "    _ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color=\"red\")\n",
    "    _ax.set_xlabel(r\"x / a.u.\", fontsize=20)\n",
    "    _ax.set_ylabel(r\"y / a.u.\", fontsize=20)\n",
    "    _ax.tick_params(labelsize=15)\n",
    "    _ax.set_xlim([-7, 7])\n",
    "    _ax.set_ylim([-7, 7])\n",
    "    _ax.set_aspect('equal')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes the example.\n",
    "\n",
    "In the remaining part, we address the performance of the pydpc implementation (numpy + cython-wrapped C code) with respect to an older development version (numpy). In particular, we look at the numerically most demanding part of computing the Euclidean distances between the data points and estimating density and delta."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "npoints = 1000\n",
    "points = np.zeros(shape=(npoints, 2), dtype=np.float64)\n",
    "points[:, 0] = np.random.randn(npoints) + 1.8 * (-1)**np.random.randint(0, high=2, size=npoints)\n",
    "points[:, 1] = np.random.randn(npoints) + 1.8 * (-1)**np.random.randint(0, high=2, size=npoints)\n",
    "\n",
    "%timeit Cluster(points, fraction=0.02, autoplot=False)\n",
    "%timeit RefCluster(fraction=0.02, autoplot=False).load(points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next two cells measure the full clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%timeit\n",
    "Cluster(points, fraction=0.02, autoplot=False).assign(20, 1.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%timeit\n",
    "tmp = RefCluster(fraction=0.02, autoplot=False)\n",
    "tmp.load(points)\n",
    "tmp.assign(20, 1.5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
